This report is based on the mini challenge 2 from VAST Challenge 2021.
This report is based on the mini challenge 2 from VAST Challenge 2021. Here is the background:
GAStech, an oil-products company from Tethys, is expanding into Kronos - an island country, built good relations with the local government as well.Not only GAStech got considerable profit, but also has an impact on the local natural environment. At the beginning of 1997,the residents of Elodis agricultural town located near the capital of Kronos began to pay attention to an abnormal increase in the occurrence of illnesses such as cancer and birth defects.And they believe this is related to Gastech’s business. The residents set up their own POK (Protectors of Kronos) organization to protect the ecological environment in Kronos and prevent continuous deterioration.
During Gastech’s celebration party in 2014, the unexplained disappearance of some employees was suspected to be related to the POK organization. And now we got the employees’ credit card and loyalty card transaction data, and also their GPS record, to identify anomalies. Due to this, the purpose of this report is to analyze the consumption behaviors of employees at different times and places, use R Studio to visualize the data, screen the existing suspicious behaviors, match the employees with varies of crads, and draw suggestions and conclusions.
R Studio was originally developed by Robert Gentleman and Ross Ihaka, an integrated development environment (IDE) for the R language. It is a standalone open source project that integrates many powerful programming tools into an intuitive, easy-to-learn interface. This visualizing analysis is based on R Studio, therefore, will introduce some of the R packages.
The tidyverse is a language for solving data science challenges with R code, encompasses the repeated tasks at the heart of every data science project:data import, tidying, manipulation, visualization, and programming. Its primary goal is to facilitate a conversation between a human and a computer about data.(Hadley Wickham, Mara Averick,etc,2019)
The tidyverse also provides the ggplot2 (Wickham, 2016) package for visualization. ggplot2 is a system for declaratively creating graphics, based on The Grammar of Graphics (Wilkinson, 2005).Data transformation is supported by the core dplyr (Wickham et al., 2019a) package. dplyr provides verbs that work with whole data frames, such as mutate() to create new variables,filter() to find observations matching given criteria.
R can be used for visualization and supports various types of charts. It supports many applications that can be visualized in a simple and contains most of the information in the plot.The library ggplot() supports visualization for various formats of data and machine learning models. The other libraries that support visualization in R are ggiraph, digraph, ggVis and so on(K. G. SrinivasaSiddesh G. M.Srinidhi H.,2018).
MC2 includes 4 CSV format files and a tourist map of Abila with locations of interest identified.
– Car-assignments: includes 45 detailed records of vehicle assignments by employees.
– Cc_data: contains 1491 credit and debit card transaction amount records in various of locations.
– Loyalty_data: 1393 observations of loyalty card transaction of GAStech’s employees.
– Gps: records the latitude and longitude by each employees’ track.
The first step is to install all the packages we will use in a follow-up study. Here will give brief introduction of some packages.
We use Tidyverse to manipulate and clean original dataset, includes ggplot2(wich is mainly for data visualization), dplyr, tidyr, and purrr. GGiraph is to make ggplot interactive with users. Plotly’s R graphing library makes interactive, publication-quality graphs.The goal of patchwork package is to make it ridiculously simple to combine separate ggplots into the same graphic. visNetwork is an R package for network visualization. lubridate can help to make up for R Studio’s lack of processing time data and makes it much easier. The aim of magrittr is to decrease development time and improve readability and maintainability of code. With the tmap package, thematic maps can be generated with great flexibility. The raster package provides classes and functions to manipulate geographic data in ’raster’ format. igraph is a collection of libraries for creating and manipulating graphs and analyzing networks.
packages = c('ggiraph', 'plotly', 'DT', 'patchwork',
'tidyverse','visNetwork','clock','lubridate',
'raster', 'sf','tmap','rgdal',"dplyr",'ggraph',
'igraph','tidygraph','magrittr','networkD3','ggalluvial')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Import the four datasets.
LoyaltyCard <- read.csv("D:/Study/Visual Analytics/Assignment/MC2/loyalty_data.csv")
CC <- read.csv("D:/Study/Visual Analytics/Assignment/MC2/cc_data.csv")
GPS <- read.csv("D:/Study/Visual Analytics/Assignment/MC2/gps.csv")
CarTrack <- read.csv("D:/Study/Visual Analytics/Assignment/MC2/car-assignments.csv")
Here is a roughly view of information in each dataset.
glimpse(LoyaltyCard)
Rows: 1,392
Columns: 4
$ timestamp <chr> "01/06/2014", "01/06/2014", "01/06/2014", "01/06/~
$ location <chr> "Brew've Been Served", "Brew've Been Served", "Ha~
$ price <dbl> 4.17, 9.60, 16.53, 11.51, 12.93, 4.27, 11.20, 15.~
$ loyaltynum <chr> "L2247", "L9406", "L8328", "L6417", "L1107", "L40~
glimpse(CC)
Rows: 1,490
Columns: 4
$ timestamp <chr> "01/06/2014 07:28", "01/06/2014 07:34", "01/06/20~
$ location <chr> "Brew've Been Served", "Hallowed Grounds", "Brew'~
$ price <dbl> 11.34, 52.22, 8.33, 16.72, 4.24, 4.17, 28.73, 9.6~
$ last4ccnum <int> 4795, 7108, 6816, 9617, 7384, 5368, 7253, 4948, 9~
glimpse(GPS)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id <int> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
glimpse(CarTrack)
Rows: 44
Columns: 5
$ LastName <chr> "Calixto", "Azada", "Balas", "Barranc~
$ FirstName <chr> "Nils", "Lars", "Felix", "Ingrid", "I~
$ CarID <int> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12~
$ CurrentEmploymentType <chr> "Information Technology", "Engineerin~
$ CurrentEmploymentTitle <chr> "IT Helpdesk", "Engineer", "Engineer"~
We can find some columns’ data types are incorrect, we will change it by following steps.
The first thing to do is to check whether there is missing value, and we find CarTrack dataset has missing value just on car ID, it is useful for us to see if there is private use of the bus problem, so do not delete.
integer(0)
integer(0)
integer(0)
[1] 36 37 38 39 40 41 42 43 44
If the result gets 0, means the inexistence of missing. We can draw from the conclusion that three of the datasets do not have any missing except CarTrack. If we go back to check this dataset, we will find that the missing values are all the IDs of drivers whose in Gastech company, which are not very useful for our subsequent research, so we can choose to delete them by using na.omit function.
From our previous brief glimpse of the data, the data type of “timestamp” column is incorrect. It should be in “date” format but is mistaken for “character” format, so here we use date_time_parse to modify it, and also extracts the hours and days as new columns in “CC” and “LoyaltyCard” respectively.
For CC dataset:
CC$timestamp <- date_time_parse(as.character(CC$timestamp),
zone = "",
format = "%m/%d/%Y %H:%M")
CC$weekday = wday(CC$timestamp,
label = TRUE,
abbr = FALSE)
CC$hour <- as.numeric(get_hour(CC$timestamp)) #extract hour as new column
a <- as.character(CC$timestamp)
For LoyaltyCard dataset:
LoyaltyCard$timestamp <- date_time_parse(as.character(LoyaltyCard$timestamp),
zone = "",
format = "%m/%d/%Y")
LoyaltyCard$Weekday = wday(LoyaltyCard$timestamp,
label = TRUE,
abbr = FALSE)
LoyaltyCard$day <- as.factor(get_day(LoyaltyCard$timestamp)) #extract day as new column
In the CarTrack dataset, we found the first name and last name were mistakenly divided into two columns, due to this, we combined them into one column with paste function.
CarTrack$Name <- paste(CarTrack$FirstName, CarTrack$LastName)
CarTrack <- CarTrack %>%
select(CarID,CurrentEmploymentTitle,CurrentEmploymentType,Name)
Because one of the location name is in Greek format, we need to change it into English, and also delete all the single quotes, it may cause misleading for following code.
LoyaltyCard$location <- gsub("'", '', LoyaltyCard$location)
LoyaltyCard$location[LoyaltyCard$location=="Katerina’s Café"]<-"Katerina Cafe"
Create new dataset named “LoyaltyCard_3”, use group_by function to count the number of occurrences at each location during this period. We will use this dataset for further visualization.
LoyaltyCard_3 <- LoyaltyCard %>%
select(location) %>%
group_by(location) %>%
summarize(weight =n()) %>%
arrange(desc(weight))
glimpse(LoyaltyCard_3)
Rows: 33
Columns: 2
$ location <chr> "Katerina Cafe", "Hippokampos", "Guys Gyros", "Brew~
$ weight <int> 195, 155, 146, 140, 84, 80, 71, 66, 60, 45, 42, 37,~
First to visualize the total loyalty card transaction number within 14 days in each locations of Kronos.
Bar <- plot_ly(data=LoyaltyCard_3,
x = ~weight,
y = ~location,
type = "bar",
text = ~paste("location:", location,
"<br>Number of times:", weight),
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)', width = 1.5))) %>%
layout(xaxis = list(title = 'Number of times'),
yaxis = list(title = "",
categoryorder = "array",
categoryarray = ~weight),
title ='Total number of loyalty card transactions')
Bar
We can find the most three popular places among employees who are using loyalty card to buy from GASTech company are cafe shop and restaurant, Katerina’s Cafe, Guy’s Gyros and hippokampos, it seems that they are in the habit of drinking coffee on a regular basis, and it may be speculated that they are in the habit of meeting customers in a cafe or having a conversation to relax.
Then we group the data by day in order to showing the changes in each day and name it as LoyaltyCard4.
LoyaltyCard_4 <- LoyaltyCard %>%
select(location,Weekday) %>%
group_by(Weekday,location) %>%
summarize(Count=n())
glimpse(LoyaltyCard_4)
Rows: 157
Columns: 3
Groups: Weekday [7]
$ Weekday <ord> Sunday, Sunday, Sunday, Sunday, Sunday, Sunday, Sun~
$ location <chr> "Abila Zacharo", "Ahaggo Museum", "Alberts Fine Clo~
$ Count <int> 2, 2, 4, 8, 5, 2, 2, 16, 20, 2, 17, 2, 9, 6, 6, 15,~
Plotly is a very useful package to make interactive, publication-quality graphs with the reader. Will use plotly to make interactive bar chart, which can select different day in the control panel.
BAR3 <- plot_ly(data=LoyaltyCard_4,
x = ~Count,
y = ~location,
color = ~location,
frame = ~Weekday,
text = ~paste("Day:", Weekday,
"<br>Count:", Count,
"<br>location:",location),
hoverinfo = "text",
type = 'bar',
mode = 'markers'
) %>%
layout(title ='Most popular location of loyalty card transaction by weekday')
BAR3
From the graph we can use the control panel to see the most popular place on Sunday is Hippokampos, then is Katerina’s Cafe. On Monday is Brewve Been Served, following by
Katerina Cafe, which is the most popular place on Tuesday and Thursday too. The most popular location on Wednesday and Friday is a combination of Monday and Tuesday, Brewve Been Served and Katerina Cafe.
So we can conclude that the most popular location for GASTech employees are Brewve Been Served and Katerina Cafe, thsy will always choose these two place to have a rest and eating.
First to create new dataset which incoludes location, hour and the number of visits GASTech employees have made to these locations, named CC_4.
CC_4 <- CC %>%
select(location,hour) %>%
group_by(location) %>%
summarize(Count=n()) %>%
arrange(desc(Count))
Create the basic bar chart for initial visualization.
Bar4 <- plot_ly(data=CC_4,
x = ~Count,
y = ~location,
type = "bar",
text = ~paste("location:", location,
"<br>Number of times:", Count),
marker = list(color = 'rgb(158,202,225)',
line = list(color = 'rgb(8,48,107)', width = 1.5))) %>%
layout(xaxis = list(title = 'Number of times'),
yaxis = list(title = 'location',
categoryorder = "array",
categoryarray = ~Count)) %>%
layout(title ='Total number of credit card transactions')
Bar4
We can see the same conclusion from using loyalty card. The most popular places for employees to go are cafes and eating places. And then we create heatmap to see the changes of each hour.
CC_5 <- CC %>%
select(location,hour) %>%
group_by(location,hour) %>%
summarize(Count=n())
Here we use heatmap to do visualization.
GASTech employees present a tracked pattern throughout the day. At 7-8 am in the morning, Brew’ve been served is the most popular place and also the Hallowed Grounds. Then they will go to work until 12 am to 13 am, chose to have lunch. At 8 p.m in the evening, maybe it is the time they leave work, go to cafes and restaurants to relax and eating.
First to import the map file and create the initial map.
bgmap <- raster("D:/Study/Visual Analytics/Mavis-yuuu/Mavis_blog_ISSS608/_posts/2021-07-12-assignment/MC2-tourist.tif")
bgmap
class : RasterLayer
band : 1 (of 3 bands)
dimensions : 1595, 2706, 4316070 (nrow, ncol, ncell)
resolution : 3.16216e-05, 3.16216e-05 (x, y)
extent : 24.82419, 24.90976, 36.04499, 36.09543 (xmin, xmax, ymin, ymax)
crs : +proj=longlat +datum=WGS84 +no_defs
source : MC2-tourist.tif
names : MC2.tourist
values : 0, 255 (min, max)
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255)

Then change the gps data format.
Abila_st <- st_read(dsn = "./Geospatial",
layer = "Abila")
Reading layer `Abila' from data source
`D:\Study\Visual Analytics\Mavis-yuuu\Mavis_blog_ISSS608\_posts\2021-08-07-assignment-mc2\Geospatial'
using driver `ESRI Shapefile'
Simple feature collection with 3290 features and 9 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82401 ymin: 36.04502 xmax: 24.90997 ymax: 36.09492
Geodetic CRS: WGS 84
gps <- read_csv("D:/Study/Visual Analytics/Assignment/MC2/gps.csv")
glimpse(gps)
Rows: 685,169
Columns: 4
$ Timestamp <chr> "01/06/2014 06:28:01", "01/06/2014 06:28:01", "01/~
$ id <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35, 35~
$ lat <dbl> 36.07623, 36.07622, 36.07621, 36.07622, 36.07621, ~
$ long <dbl> 24.87469, 24.87460, 24.87444, 24.87425, 24.87417, ~
gps$Timestamp <- date_time_parse(gps$Timestamp,
zone = "",
format = "%m/%d/%Y %H:%M:%S")
gps$weekday = wday(gps$Timestamp,
label = TRUE,
abbr = FALSE)
gps$id <- as_factor(gps$id)
gps_sf <- st_as_sf(gps,
coords = c("long", "lat"),
crs= 4326)
gps_sf
Simple feature collection with 685169 features and 3 fields
Geometry type: POINT
Dimension: XY
Bounding box: xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS: WGS 84
# A tibble: 685,169 x 4
Timestamp id weekday geometry
* <dttm> <fct> <ord> <POINT [°]>
1 2014-01-06 06:28:01 35 Monday (24.87469 36.07623)
2 2014-01-06 06:28:01 35 Monday (24.8746 36.07622)
3 2014-01-06 06:28:03 35 Monday (24.87444 36.07621)
4 2014-01-06 06:28:05 35 Monday (24.87425 36.07622)
5 2014-01-06 06:28:06 35 Monday (24.87417 36.07621)
6 2014-01-06 06:28:07 35 Monday (24.87406 36.07619)
7 2014-01-06 06:28:09 35 Monday (24.87391 36.07619)
8 2014-01-06 06:28:10 35 Monday (24.87381 36.07618)
9 2014-01-06 06:28:11 35 Monday (24.87374 36.07617)
10 2014-01-06 06:28:12 35 Monday (24.87362 36.07618)
# ... with 685,159 more rows
Derive the path data.
gps_path <- gps_sf %>%
group_by(id, weekday) %>%
summarize(m = mean(Timestamp),
do_union=FALSE) %>%
st_cast("LINESTRING")
gps_path
Simple feature collection with 260 features and 3 fields
Geometry type: LINESTRING
Dimension: XY
Bounding box: xmin: 24.82509 ymin: 36.04802 xmax: 24.90849 ymax: 36.08996
Geodetic CRS: WGS 84
# A tibble: 260 x 4
# Groups: id [40]
id weekday m geometry
<fct> <ord> <dttm> <LINESTRING [°]>
1 1 Sunday 2014-01-15 14:31:58 (24.88259 36.06643, 24.8824 36.~
2 1 Monday 2014-01-09 17:37:44 (24.88258 36.06646, 24.88259 36~
3 1 Tuesday 2014-01-11 01:34:40 (24.87957 36.04803, 24.87957 36~
4 1 Wednesd~ 2014-01-11 12:30:08 (24.88265 36.06643, 24.88266 36~
5 1 Thursday 2014-01-13 07:50:33 (24.88261 36.06646, 24.88257 36~
6 1 Friday 2014-01-13 13:58:58 (24.88265 36.0665, 24.88261 36.~
7 1 Saturday 2014-01-15 01:53:26 (24.88258 36.06651, 24.88246 36~
8 2 Sunday 2014-01-16 03:54:17 (24.86041 36.08542, 24.86045 36~
9 2 Monday 2014-01-10 05:54:49 (24.86038 36.08546, 24.86038 36~
10 2 Tuesday 2014-01-10 16:46:43 (24.86041 36.08549, 24.86037 36~
# ... with 250 more rows
Visualize the paths by each id.
gps_path_selected <- gps_path %>%
arrange(desc(id))
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_selected) +
tm_facets(by = "id",ncol = 1) +
tm_layout(legend.show=FALSE) +
tm_lines(col = "weekday", lwd = 7)
And also we can use this code to show detailed tracks on weekday by one employee.
gps_path_selected <- gps_path %>%
filter(id == "28")
tmap_mode("view")
tm_shape(bgmap) +
tm_rgb(bgmap, r = 1,g = 2,b = 3,
alpha = NA,
saturation = 1,
interpolate = TRUE,
max.value = 255) +
tm_shape(gps_path_selected) +
tm_facets(by = "weekday",ncol = 1) +
tm_layout(legend.show=FALSE) +
tm_lines(col = "weekday", lwd = 7)